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Abstract 

Background: Continuing research into the global multiple sequence alignment problem has resulted in more 
sophisticated and principled alignment methods. Unfortunately these new algorithms often require large amounts 
of time and memory to run, making it nearly impossible to run these algorithms on large datasets. As a solution, 
we present two general methods, Crumble and Prune, for breaking a phylogenetic alignment problem into smaller, 
more tractable sub-problems. We call Crumble and Prune meta-alignment methods because they use existing 
alignment algorithms and can be used with many current alignment programs. Crumble breaks long alignment 
problems into shorter sub-problems. Prune divides the phylogenetic tree into a collection of smaller trees to 
reduce the number of sequences in each alignment problem. These methods are orthogonal: they can be applied 
together to provide better scaling in terms of sequence length and in sequence depth. Both methods partition the 
problem such that many of the sub-problems can be solved independently. The results are then combined to 
form a solution to the full alignment problem. 

Results: Crumble and Prune each provide a significant performance improvement with little loss of accuracy. 
In some cases, a gain in accuracy was observed. Crumble and Prune were tested on real and simulated data. 
Furthermore, we have implemented a system called Job-tree that allows hierarchical sub-problems to be solved 
in parallel on a compute cluster, significantly shortening the run-time. 

Conclusions: These methods enabled us to solve gigabase alignment problems. These methods could enable a 
new generation of biologically realistic alignment algorithms to be applied to real world, large scale alignment 
problems. 



Background 

Multiple sequence alignment methods are a major tool 
in comparative genomics. The alignments they generate 
are primary data for a wide array of analyses: discovery 
of evolutionarily conserved elements [1,2], identification 
of functional RNAs [3], reconstruction of evolutionary 
events [4] to name a few. 

New high throughput sequencing methods are greatly 
increasing the amount of available sequence informa- 
tion. To take full advantage of this wealth of data, cur- 
rent multiple sequence alignment methods will need to 
be adapted to handle larger datasets. 
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We present two general methods to adapt current glo- 
bal alignment algorithms to large scale problems. This 
will enable current methods to be used for larger and 
larger problems and also allow computationally expen- 
sive methods to be applied to biologically relevant 
problems. 

Related Work 

Segmentation methods have been used to align full 
genomes. To align the mouse genome to the human 
genome, Schwartz et al. divided the human genome into 
approximately 3,000 segments of ~ 1.01 Mb with a 10 
kb overlap between adjacent segments [5]. They 
hypothesize that any alignment that extends for 10 kb is 
almost certain to contain an alignment that would 
bridge the adjacent segments and thus no alignment will 
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be lost by the segmentation. Crumble is a more prin- 
cipled segmentation method that, given some assump- 
tions, guarantees that each segment contains all the 
sequence necessary for correct alignment. 

Dress et al. describe a divide and conquer method that 
recursively subdivides three sequence alignment pro- 
blems length-wise to produce shorter alignment pro- 
blems [6]. They outline how their method might be 
extended to more sequences but since it is based on 
considering all pairs of sequences, it is not applicable to 
beyond moderately sized problems. Reinert et al. use the 
above method in an iterative framework that allows the 
an alignment to be refined until the alignment score 
stops improving or time runs out [7]. This methods is 
also too computationally expensive to apply to large 
datasets. 

The MISHIMA algorithm of Kryukov and Saitou 
starts by finding a non-conflicting set of k-mers shared 
by all sequences [8]. The sequences between these k- 
mers are then aligned independently and concatenated 
to form the complete alignment. Crumble uses a more 
general method that does not require the constraints to 
span all sequences. Thus Crumble can be used to align 
more diverged sequences where it is impossible to find 
/c-mers shared by all sequences. Crumble also takes 
advantage of parallelization while MISHIMA currently 
does not. 

The method employed by Prune can be considered a 
generalization of the progressive alignment methodology 
[9,10]. Progressive alignment proceeds by merging sib- 
ling nodes to form the alignment of the parent. Thus it 
only moves up the tree one node at a time. Prune, on 
the other hand, can move up the tree several nodes at a 
time. Progressive alignment also uses leaf sequences to 
guide the merging process while Prune uses a more gen- 
eral strategy. 

Several authors have extended specific alignment algo- 
rithms to take advantage of parallelization, usually 
thread or intra-core parallelization [11-13]. These meth- 
ods are not designed to take advantage of cluster level 
parallelization. Furthermore they are specific to a given 
alignment program. The open framework we present 
here is designed to leverage parallelization for a general 
class of multiple alignment programs. Thus it can be 
used to enable the parallelization of many current and 
future global alignment algorithms. 

Implementation 

We adopt the general approach of dividing a large align- 
ment problem into smaller sub-problems. The sub-pro- 
blems are solved and the alignments are recombined to 
form a solution to the whole, original problem. We per- 
form this division in such a way that most sub-problems 
are independent and thus can be solved in parallel. This 



can lead to reduced run-time because of smaller indivi- 
dual problems and because sub-problems can be solved 
in parallel on a multi-core machine or cluster. Smaller 
problems can also result in less memory consumption; 
allowing some methods to be applied at all. 

An alignment problem can be large in two ways: large 
in sequence length or large in sequence number. Crum- 
ble deals with problems that are large in sequence 
length. It breaks up long alignment problems into 
shorter problems. Prune handles problems with a large 
number of sequences. It cuts up deep alignment pro- 
blems into sub-problems with fewer sequences. 

Our methods do not perform the actual alignment of 
sub-problems. Other programs are used for that part of 
the process. Almost any current global alignment algo- 
rithm can be adapted to work with Crumble and Prune 
by writing a simple wrapper. Thus we refer to Crumble 
and Prune as meta- alignment methods. In theory, both 
these methods would result in performance gains at the 
loss of some alignment accuracy. However, we found 
that very significant performance gains can be made 
with negligible loss of alignment accuracy and in a few 
cases some accuracy can be gained. 

Crumble: breaking long alignment problems into shorter 
sub-problems 

Given a set of sequences, we begin by generating a system 
of sequence constraints. These constraints align a base 
position in one sequence with a base position in another 
sequence [14]. Constraints of this form can be thought of 
as a sparse alignment and have been used in several align- 
ment programs [15-17]. The constraints impose a partial 
ordering ■< on the sequences [18]. Under this partial order- 
ing, aligned positions are equal and base positions along 
the same sequence are increasing with respect to the par- 
tial ordering. Given a base position x in a sequence, Crum- 
ble searches for a set of base positions A = \a,\, a 2 , ... , a n } in 
each of the n species being aligned such that X 4 «, for all i 
(Figure 1). We now consider the constraints as a sparse 
alignment and let y be the right-most position in A, i.e. y 
such that di4 y for all i. The sequences between x and y, 
inclusive, forms a separation that divides the alignment 
into three classes: sequences to the left of x; x, y, and the 
sequences between them; and sequences to the right of y. 
No base position less that x in the partial ordering may be 
aligned with a base position greater than y. Such an align- 
ment would contradict the partial ordering imposed by 
the constraints. Note that if X = y, then the separation is 
simply a column of fully aligned sequence in all species. 
Crumble tiles the constraints with a set of positions 

Xi =4 y 1 < x 2 -4 Yi < ■ ■ ■ Xm < Ym, 

where each pair x it y t forms a separation as described 
above. The sequences between y k _ 1 and x k compose a 
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Figure 1 A set of constraints visualized as a sparse alignment. In each species, positions increase from left to right with respect to the 
partial order. Sequence positions in the same column are equal under the partial order. A set of x k , y k separations breaks the sparse alignment 
into a set of blocks. Positions x kA , y kA and x k , y k define block k, which is composed of the core k (light gray) and the adjacent separation (dark 
gray). 



core. As noted above, each core is independent in the 
sense that sequences in one core cannot be aligned with 
sequences in another core. The position of the 
sequences between cores is ambiguous. Both sets of 
adjacent sequences need to be considered when aligning 
a core. Thus we define a block as a core together with 
its flanking separations (Figure 1, bottom). The pairs x t , 
ji are chosen to maintain a user-selected approximate 
core size. The exact size of a block will depend on the 
set of initial constraints. To prevent blocks from becom- 
ing too large, a user-defined maximum block size can 
also be set. If the maximum is reached, the block is 
truncated to the maximum block size. The truncated 
sequence becomes part of the next block. 

Because the blocks are semi-independent, an align- 
ment of all sequences is constructed as follows: each 
block is aligned in parallel using a user-specified global 
alignment method; each aligned block is then trimmed 
until there is no sequence overlap between adjacent 
blocks; then the trimmed off sequences are realigned; 
and, finally, all sub-alignments are concatenated to form 
the full alignment (Figure 2). 

The method used to generate the sparse constraints is 
user-definable. By writing a simple wrapper, any current 
alignment constraint method can be used. If the con- 
straints are a subset of the true alignment, then the 
blocks produced by Crumble contain all the sequence 
necessary for their correct alignment. 

Prune: trimming deep alignment problems into smaller 
sub-problems 

Given a set of sequences and a rooted phylogenetic tree, 
Prune breaks the tree into sub-trees that overlap at an 



internal node. Prune infers a sequence for the root of 
each sub-tree, i.e., the overlapping nodes. Using inferred 
sequence at an internal node breaks the conditional 
dependencies between the sub-trees. In a leaf-to-root 
fashion, Prune aligns the sequences in each sub-tree as 
well as the sequences of any inferred nodes included in 
the sub-tree. A sequence is then inferred for the root of 
the sub-tree. This inferred sequence is used in the align- 
ment of the sequences in the parent's sub-tree. Once all 
sub-tree alignments have been formed, the alignments 
of each sub-tree are merged using the inferred overlap- 
ping root sequence as a guide (Figure 3). To provide 
information about the rest of the tree, an out-group 
sequence from a leaf node closest to the sub-tree root, 
as measured by branch length, is included when aligning 
each sub-tree. Prune forms the sub-trees so as to mini- 
mize the number of stages that must be performed in 
sequence while enforcing a maximum number of 
sequences in each sub-tree, including inferred 
sequences. If the tree nodes are numbered 1, 2, N 
and 7? is the root node, then the minimum number of 
stages needed with a maximum of M sequences per 
sub-tree is given by S R1 . S t j is defined by the following 
recurrence relation: 



i leaf and ; ■ 
i leaf and - 



1, 

2....M, 



1 

oo 

min (max{S; in , S r , ra }) + 1 i internal node and ; = 

n+m<M 

min(max{S;,„, S r ,., 
n+m=j 



1) 



i internal node and; = 2,...,M, 



(1) 
(2) 
(3) 

(4) 



where r, and are the right and left children of inter- 
nal node i respectively. Informally, is the minimum 
number of stages needed if the sub-tree rooted at node i 
has /' sequences (leaf or inferred) in it. The only sub- 
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Figure 2 The Crumble pipeline. The pipeline used after the formation of semi-independent blocks (A). Blocks are aligned (B) and trimmed 
to remove overlap (C). Overlaps are aligned (D), and the final alignment is formed by concatenation (E). Note that the alignments in (B) and (D) 
can be performed in parallel. 



trees rooted at a leaf that make sense are those contain- 
ing one node. Thus Equation (1) assigns all sub-trees 
rooted at leaf i containing only one sequence a stage 
count of 1. This trivial "alignment" of a single sequence 
is counted as a stage since time is spent extracting and 
processing the sequence. Equation (2) assigns all other 
trees below leaf i (but not rooted at i) a stage count of 
°°. Equation (3) calculates the minimum number of 
stages needed if node i is the root of a sub-tree. It does 
this by considering all ways to merge the right (r,) and 
left (/,■) children of node i into a new sub-tree while 
respecting the maximum number of sequences per sub- 
tree, M. It selects the merge that minimizes the number 
of stages at i. The new stage depth becomes one more 
than this minimum. Equation (4) calculates the mini- 
mum number of stages needed if the sub-tree below 
node i (but not rooted at i) contains exactly / sequences, 
leaf or inferred. Thus, Sri is the minimum number of 
stages needed for the entire tree if the global root node 
R is the root of a sub-tree. The above recurrence rela- 
tion can be calculated using a dynamic programming 



algorithm and the sub-trees that achieve the minimum 
number of stages can be inferred. For a given maximum 
number of sequences per sub-tree, the dynamic pro- 
gramming calculation takes time linear in the number 
of species. 

Originally, two methods were used to infer the root 
sequence: Ortheus [19] and Maximal, a heuristic that 
assumed that the most commonly occurring base in an 
alignment column was the base for that position in the 
root. Maximal is such named because it infers the long- 
est possible root sequence that fits within the alignment 
(Figure 4). Surprisingly, we found that Maximal per- 
formed better that Ortheus for this particular application 
(Table 1). It is possible that the Maximal method, while 
less biologically meaningful, provides more opportunities 
for homologous positions to be aligned. Therefore, we 
adopted Maximal as the default inference method and 
performed our benchmarking using it. 

The sub-tree alignment method and the inference 
method are user-definable. The maximum number of 
sequences in each sub-tree is also user-configurable. 
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Figure 3 Prune partitioning of a phylogenetic tree of 44 species. Prune run with a maximum sub-tree size of 10 sequences breaks the tree 
into seven sub-trees. Six of the sub-trees (light gray) can be aligned in parallel because they contain only known leaf sequences (filled squares; 
out-groups for sub-trees are not shown). Once these six sub-trees are aligned and the sequence of the roots (filled circles) is inferred, the 
internal sub-tree (dark gray) can be aligned. Note that this sub-tree includes both leaf sequences and inferred sequences. The alignments from 
the light gray sub-trees are merged with the alignment of the dark gray sub-tree to form the alignment of the entire tree. 

leaf sequences, the addition of out-groups is disabled, 
and parallelization is disabled. 

Job-tree: Solving gigabase alignment problems with 
Prune and Crumble 

Prune and Crumble can be used together to align long 
and deep alignment problems. To take advantage of 
both Prune and Crumble's parallelization, a cluster sys- 
tem with the following functionality is needed: any job 
must be able to spawn its own set of parallel jobs. This 
is because Prune's sub-tree alignment tasks are run in 
parallel and each of those tasks executes Crumble which 
in turn spawns its own set of parallel jobs. This requires 
a hierarchical and dynamic job system. 

Job-tree is a batch system designed to manage jobs in 
a cluster running on top of an existing batch system, 



As discussed above, Prune can be considered a gener- 
alization of progressive alignment. The two are equiva- 
lent if, in Prune, the maximum number of nodes per 
sub-tree is set at two, the inference method selects only 



Species 1 GC TGCCAAT - - CATT 

Species 2 CCAAGCTGCCAATGCCATT 

Species 3 GCTAGCAGC CTT 

Maximal root GCAAGCTGCCAATGCCATT 

Figure 4 Example of the Maximal root inference method. Every 
alignment column is assigned the most frequently occurring base 
in the column. Thus Maximal infers the longest possible root 
sequence that fits within the alignment. 
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Table 1 Comparison between root inference methods 



50 leaves 
Agreement 

Number Nodes Maximal Ortheus 


Prune w/Pecan 


30 


0.880 


0.579 




15 


0.909 


0.560 




7 


0.912 


0.555 


Prune w/FSA 


30 


0.912 


0.574 




15 


0.893 


0.523 




7 


0.885 


0.495 


Prune w/MUSCLE 


30 


0.899 


0.579 




15 


0.896 


0.555 




7 


0.905 


0.501 



The average agreement score of Prune alignments when Maximal and 
Ortheus root inference methods are used. Fifty alignment problems with fifty 
leaf species and -10 kilobases of sequence were used. Three underlying 
alignment algorithms and three different maximum sub-tree sizes were used 
in the comparison. The faster Maximal method performed better across all 
comparisons that Ortheus for this application. 

such as Parasol [20], LSF [21], or Sun Grid Engine [22]. 
Job-tree makes it simple for jobs to dynamically create 
new jobs in a hierarchical fashion (Figure 5). 

Communication between Job-tree and the jobs pro- 
ceeds via XML files that are processed before and after 
each job. Job-tree is currently implemented in Python 
and works on top of the Parasol batch system but can 
be extended to work with other job systems. It also fea- 
tures a serial mode that runs on a single machine, 
which is useful for testing or for small tasks. 

Results and Discussion 

We used two methods to measure the effect that Crum- 
ble and Prune had on alignment accuracy and run-time. 
The first method uses a large collection of alignment 
problems generated by simulation. Alignment problems 
generated by simulation provide "true alignments" that 
can be used to measure the quality of predicted align- 
ments. We created a large collection of simulated data- 
sets to test Crumble and Prune. The second method 
uses real biological data. We tested Crumble by aligning 
sequence from six species to a 90 kb block of the 
human genome. Prune was tested on twelve alignment 
problems from the Rfam database that contain more 
than 200 sequences [23]. Datasets with fewer sequences 
can be easily and quickly aligned by current methods 
and thus are not on the scale of problem that Prune 
was designed to address. 

Simulation Results 

All computations in this section were performed on an 
AMD based cluster. Each compute node had two AMD 
Opteron 246 HE processors running at 2 GHz with 4GB 
of memory. The largest number of parallelizable jobs 
was 26. While hard to guarantee in a shared work 



environment, there were sufficient nodes available to 
allow most parallelizable jobs to be run in parallel. 
Crumble 

We used a published dataset of simulated sequences 
[24]. The dataset contained 50 alignment problems over 
nine species with -60 kilobase of sequence per species. 
These alignments were generated using a model of neu- 
tral evolution and have been used to measure the align- 
ment accuracy of several other methods [24,17,25,16]. 
Using the same simulation program, we generated addi- 
tional longer datasets with -150 kilobases, -500 kilo- 
bases, and -1 megabase of sequence over the same nine 
species tree as used in previous studies. The simulation 
program was independently developed and models sub- 
stitutions using the HKY model [26], deletions and 
insertions along each branch as well as retrotransposon 
insertions. For more details see the supplemental mate- 
rials of Blanchette et al. [24]. The total length of the 
tree was 1.13 substitutions per site. Fifty simulations 
were run for each sequence size. In total, 200 alignment 
problems, totaling -760 megabases of sequences, were 
used to test Crumble. 

To measure the degree of similarity between predicted 
and true alignments, we used the average agreement 
score [24]. For pairwise alignments, the average agree- 
ment score (agreement for short) is as defined the frac- 
tion of positions of the predicted alignment that agree 
with the true alignment. There is agreement if the i-th. 
nucleotide of species X is aligned to the j-th nucleotide 
of species Y in both the predicted and true alignment or 
if the positions are similarly aligned to gap characters. 
Thus, two identical alignments obtain an agreement 
score of 1, whereas two completely different alignments 
get a score of zero. For more than two species, the pair- 
wise score is averaged over all pairs of species. 

We calculated the agreement score resulting from 
Crumble breaking up the alignment problem into sev- 
eral fractional sized pieces (Table 2). We used Pecan 
[17], FSA [25], and MUSCLE [27] as the underlying 
alignment algorithms. The set of sparse constraints used 
to break up the alignment problem was generated with 
PrePecan, the constraint generation system employed by 
Pecan. As the block size decreased, Crumble broke the 
alignment problem into smaller and smaller sub-pro- 
blems that were aligned in parallel using the Parasol job 
system. As the sub-problems decreased in size, the run- 
time decreased, as expected, with only a small loss in 
accuracy. For longer alignment problems, a much more 
pronounced performance gain was observed and, at the 
same time, the alignment accuracy was maintained. 
Some methods were unable to solve long alignment pro- 
blems directly because of high memory usage. Crumble 
allows these methods to be applied to longer problems 
than were previously possible. 
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Figure S Schema of the Job-tree job system. In Job-tree, job a creates a set of jobs that perform a task in parallel. These jobs are collectively 
called the children of job a. The job also creates a follow-on job to be performed after all children have successfully completed. The follow-on 
job is responsible for cleaning up the input files created for the children and for any further processing. After job a ends successfully, the batch 
system runs the children. These jobs may, in turn, have children and follow-on jobs. Upon completion of all descendants, the follow-on job is 
run. The follow-on job may create more children. 



Table 2 Crumble results for different sized simulated datasets and underlying alignment methods 









60 kb 




150 kb 




500 kb 




1000 kb 






Time 


Agreement 


Time 


Agreement 


Time 


Agreement 


Time 


Agreement 


Pecan 1 


3.43 


0.896 


10.6 


0.905 


46.9 


0.906 


100 


0.906 




Crumble w/Pecan 


60% 


3.29 


0.894 


7.18 


0.904 


21.5 


0.905 


51.9 


0.906 




30% 


2.56 


0.889 


4.66 


0.903 


11.9 


0.905 


23.5 


0.905 




15% 


2.39 


0.859 


3.77 


0.893 


8.29 


0.903 


13.9 


0.905 


FSA 2 




37.4 


0.886 


-0 


_<3 


_o 


_a 


_<3 


J3 


Crumble w/FSA 


60% 


25.8 


0.881 


69.8 


0.903 


-0 


_a 


J3 


J3 




30% 


21.0 


0.873 


3act9.2 


0.898 


_o 


_a 


-0 


_<3 




15% 


17.7 


0.849 


25.5 


0.893 


104. 


0.811 


-0 


_0 


MUSCLE 3 


_a 


_a 


_0 


_a 


_0 


_a 


_a 


_a 




Crumble w/MUSCLE 


60% 


_a 


_0 


_a 


_0 


J3 


_a 


_a 


_a 




30% 


128 


0.707 


_a 


_<3 


_o 


_a 


-0 


_o 




15% 


63.1 


0.679 


251. 


0.705 


_o 


_a 


-0 


a 



Pecan was run with default parameters. 

2 FSA was run with the -exonerate, -anchored, and -sof tmasked flags. 

3 MUSCLE was run with default parameters. 

a The majority of these problems were unable to be aligned due to running out of memory. 

The run-time and average agreement score of Crumble alignments of different sized datasets. Several sets of simulated alignment problems were generated 
using a root sequence of 60, 150, 500, and 1000 kilobases. The neutral evolution of each root sequence was simulated over a nine species tree. Fifty problems 
were generated per root size for a total of two hundred test alignment problems. The agreement and run-time (in minutes) for each problem size is the average 
over the fifty simulated alignments. Crumble was used to break the problems down to sub-problems that were 60%, 30%, and 15% of the length of the original 
problem. The approximate core size was set to 60%, 30%, and 15% of the length of the original problem and the block was allowed to be at most 4 kb larger as 
measured in any of the sequences. Pecan, FSA, and MUSCLE were used as the underlying alignment method. PrePecan was used to generate the constraints. We 
were unable to apply FSA directly (not using Crumble) to 150 kb or larger problems because FSA required more than the 4GBs of memory we had available per 
cluster node. Using Crumble we were able to run FSA on problems as large as half a megabase. MUSCLE had more memory issues but we were able to use it on 
problems as large as 150 kb using Crumble. For Pecan, Crumble achieved more than a seven fold speedup with almost no loss of accuracy on the largest 
problem size. 
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Prune 

A similar methodology was used to evaluate the align- 
ment accuracy achieved using Prune. The neutral evolu- 
tion of -10 kilobase of DNA on a 50, 100, 500, and 
1000 species tree was generated using the same simula- 
tion program used above. Trees were generated with 
Bio: :Tree: : RandomFactory module of BioPerl 
[28]. Lacking data on retrotransposon insertions into 
large clades, we disabled the retrotransposon modeling 
in the simulation program. To compensate for the lack 
of complexity, we greatly increased the tree length to 
187.8, 368.8, 709.2, and 925.9 substitutions per site for 
the 50, 100, 500, and 1000 leaf trees respectively. Pecan, 
FSA and MUSCLE were employed for the underlying 
alignment algorithm. We also compare Prune to 
MAFFT [29] and SATe [30] which specialize in many 
species multiple alignment problems. 

Unlike for Crumble, we do not expect to see a mono- 
tonic decrease in running time as Prune breaks the tree 
into smaller and smaller sub-trees. While smaller sub- 
trees will results in faster alignment of the sub-pro- 
blems, it can also result in more alignment stages. This 
is different from Crumble because the number of stages 
in Crumble is fixed. We have observed this non-mono- 
tonicity of running times (Table 3). Unexpectedly, we 
also see non-monotonicity in agreement as Prune breaks 
the tree into smaller and smaller sub-trees. In some 
cases, better agreement is actually achieved with smaller 
sub-trees. We hypothesize that, on deep alignments, 
some methods discard large amounts of information in 
order to fit the problem in memory. On smaller sub- 
trees more information can be retained which results in 
an increase agreement that outweighs any loss that 
comes from considering only a sub-tree. In general, as 
the sub-trees decreased in size, the run-time decreased 
significantly with only a small loss in accuracy. 
Job-tree 

Using the Job-tree system, we were able to apply Pecan 
to gigabase size alignment problems. The neutral evolu- 
tion of a one megabase root sequence was simulated on 
a 1000 species tree. Prune was used to break the tree 
into sub-trees no larger than 10 nodes. Crumble was 
then used to break each 10 species sub-problem into 
100 kilobase chucks that were then aligned with Pecan. 
All the sub-problems were then assembled to form a 
solution to the entire gigabase alignment problem. 

Each of the fifty gigabase alignment problems took 
just over nine hours to solve. No other tested alignment 
method was able to align problems of this magnitude. 
To calculate the average agreement score, we were 
forced to randomly sample the pairs of sequences 

instead of averaging over all ( 1000 ) = 499500 pairs. 



Using this method, the agreement score for these solu- 
tions was calculated as 0.754. Since no other methods 
were able to solve an alignment problem of this scale, 
we have nothing with which to compare. 

Biological Data 

The above simulation studies provide a comprehensive 
look at the effect Crumble and Prune have on perfor- 
mance. It is also enlightening to look at performance on 
"real world" datasets. 

The main problem with real world alignment pro- 
blems is evaluation. Unlike simulated data, there is no 
clear alignment with which to compare. For RNA and 
protein families, there does exist large, commonly used 
databases of multiple alignment problems that are used 
for benchmarking. The reference multiple alignments in 
these databases are usually hand curated multiple align- 
ments generated by an ensemble of alignment tools. 
These reference multiple alignments are commonly used 
as the "true alignment" and used to measure the quality 
of predicted alignments. We used alignment problems 
from the Rfam database of RNA multiple alignments to 
test Prune on real, biological data [23]. 

For genomic scale problems, there is currently no 
large, commonly used database of genomic scale multi- 
ple alignment problems. The sequences in Rfam are 
very short (<1.2 kb) and thus not useful for testing 
Crumble. Thus, we used a different method to evaluate 
Crumble on long alignment problems. We extracted and 
aligned a 90 kb region of the human genome and the 
orthologous region from six other species. To get an 
idea of alignment accuracy, we calculated the log-likeli- 
hood of the alignment. We used phyloFit from the 
PHAST package to calculate the maximum log-likeli- 
hood of the alignment [31]. While the tree topology was 
fixed, phyloFit was allowed to vary branch lengths when 
calculating the maximum log-likelihood of each align- 
ment. We used the REV base substitution model where 
gaps are treated as missing data [32]. While log-likeli- 
hood values closer to zero only represent higher likeli- 
hood that the alignment was generated by the given 
phylogenetic model, we believe the log-likelihood is a 
reasonable, if ad hoc, method of assessing alignment 
quality when no good reference alignment is available. 

The computations in this section were performed on a 
different cluster than the one used for the results of Sec- 
tion. That cluster has been reappropriated. The cluster 
used for this section has eight nodes each with two 
dual-core AMD Opteron 2214 HE running at 2.2 GHz 
and 32GB of memory. 
Crumble 

We examined Crumble's effect on performance by align- 
ing genomic DNA from 7 species. Looking at the UCSC 
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Table 3 Prune results for different sized datasets and underlying alignment methods 









50 leaves 




100 leaves 




500 leaves 




1000 leaves 






Time 


Agreement 


Time 


Agreement 


Time 


Agreement 


Time 


Agreement 


Pecan 1 




21.9 


0.914 


297. 


0.879 


_a 


-0 


_a 


_o 


Prune w/Pecan 


60% 


7.26 


0.880 


39.2 


0.862 


_a 


_a 


_a 


_o 




30% 


3.13 


0.909 


19.6 


0.839 


_a 


_c 


_a 


_a 




15% 


7.26 


0.912 


13.3 


0.878 


125. 


0.844 


_a 


_o 




7% 


4.24 


0.909 


13.5 


0.849 


29.1 


0.907 


122. 


0.877 


FSA 2 




63.1 


0.933 


266. 


0.856 


_a 


-0 


_a 


_o 


Prune w/FSA 


60% 


33.8 


0.912 


78.9 


0.838 


589. 


0.871 


_a 


-0 




30% 


10.5 


0.893 


23.8 


0.838 


142. 


0.879 


_a 


-0 




15% 


4.25 


0.885 


17.1 


0.857 


40.8 


0.877 


150. 


0.861 




7% 


3.00 


0.866 


4.23 


0.842 


12.7 


0.903 


34.8 


0.887 


MUSCLE 3 




55.6 


0.905 


138. 


0.799 


_b 


_b 


_b 


_b 


Prune w/MUSCLE 


60% 


40.7 


0.899 


77.9 


0.777 


886. 


0.862 


_b 


_b 




30% 


24.7 


0.896 


42.8 


0.777 


368. 


0.883 


_b 


_b 




15% 


15.1 


0.905 


29.1 


0.828 


185. 


0.899 


440. 


0.900 




7% 


24.7 


0.905 


18.8 


0.841 


114. 


0.924 


228 


0.928 


MAFFT 4 




3.17 


0.897 


5.39 


0.806 


20.1 


0.886 


25.2 


0.912 


SATe 5 




101. 


0.915 


301. 


0.840 


_b 


_b 


_b 


b 



1 Pecan was run with default parameters. 

2 FSA was run with the -exonerate, -anchored, -sof tmasked, and -fast flags. 

3 MUSCLE was run with default parameters. 

4 MAFFT was run with the -treein option. 

5 SATe was run with the -t option but limited to two iterations. We found that more iterations did almost nothing for accuracy. 
0 The majority of these problems were unable to be aligned due to running out of memory. 

h The majority of these problems took longer than 3 days and were aborted. 

The run-time and average agreement score of Prune alignments of different sized datasets. Several sets of simulated alignment problems were generated using a 
root sequence of 10 kilobases. The neutral evolution of each root sequence was simulated over 50, 100, 500, and 1000 species trees. Fifty problems were 
generated per tree size for a total of two hundred test alignment problems. The agreement and run-time (in minutes) for each problem size is the average over 
the fifty simulated alignments. Each underlying alignment method was tested on the dataset (Pecan, FSA, MUSCLE). Prune was then used to break the problems 
down into sub-trees that contained at most 60%, 30%, 15%, and 7% of the nodes in the entire tree. The largest number of stages was six but most of the 
problems had no more than 3 stages. Pecan, FSA, and MUSCLE were used as the underlying alignment method to Prune. We also performed alignment using 
MAFFT and SATe to compare against. To ensure a fair comparison, the true tree topology was passed to SATe (using -t option) and to MAFFT (using the poorly 
documented -treein option). We were unable to apply some alignment algorithms to large problems because of very long run-times and memory issues. 
Using Prune, we were able to use Pecan, FSA, and MUSCLE to solve alignment problems that were much deeper than could be solved without Prune. Prune 
achieved a very large speedup with little loss of accuracy and sometimes with an increase in accuracy. 



Genome browser, we selected a -90 kb region on the 
human genome, chrl4:104721193-104812803 [33]. This 
region was selected because it contained some but not 
an excessive number of rearrangements. Thus it was a 
reasonable problem for global alignment algorithms. 
Using the "Chain/Net" tracks [34], we found the best 
matching region in cow (bosTau4 assembly), dog 
(canFam2), mouse (mm9), chimp (panTro2), macaque 
(rheMac2), and rat (rn4) and extracted the corre- 
sponding sequence to form the alignment problem. 

As above, Crumble was used to break the alignment 
problem into several fractional sized pieces (Table 4) 
and Pecan [17], FSA [25], and MUSCLE [27] were used 
as the underlying alignment algorithms. The set of 
sparse constraints was generated with PrePecan. The 
log-likelihood of the alignment was used to measure 
alignment quality. As on the simulated data, Pecan run- 
ning by itself produced the best alignments under the 
log-likelihood measure. Crumble running on top of 



Pecan was able to halve the running time with a very 
small loss in log-likelihood. Crumble running with FSA 
achieved greater performance gains over FSA with even 
less decrease in the log-likelihood score. MUSCLE was 
unable to solve this alignment problem without the aid 
of Crumble. 
Prune 

To evaluate the performance of Prune on real world pro- 
blems, we considered alignment problems from the Rfam 
seed alignment database with more than 200 sequences 
per RNA family [23]. This criteria gave twelve families: 
tRNA, 5S_rRNA, SRP_bact, MIR807, Cobalamin, 
PK- G12 rRNA, SSU_rRNA_5, RNaseP_bact_a, 
tmRNA, SAM, U2, and U6. We tested the generated align- 
ments against the hand curated "seed" alignment main- 
tained by Rfam using the agreement score (Table 5). As 
for the simulated data in Table 3, we see non-monotoni- 
city in agreement score: as Prune breaks the tree into 
smaller and smaller sub-trees, the agreement score starts 
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Table 4 Crumble results for 90 kb of genomic DNA from 
seven species 







Time 


Log-likelihood 1 


Pecan 2 




11.3 


-0.354 


Crumble w/Pecan 


60% 


7.42 


-0.355 




30% 


4.67 


-0.357 




15% 


5.42 


-0.357 


FSA 3 




38.3 


-0.374 


Crumble w/FSA 


60% 


20.4 


-0.375 




30% 


12.2 


-0.375 




15% 


9.68 


-0.376 


MUSCLE 4 




_a 


_o 


Crumble w/MUSCLE 


60% 


_a 


-0 




30% 


153. 


-0.363 




15% 


59.2 


-0.367 



1 The log-likelihood of the alignment as calculated by phyloFit, in millions of 
nats. 

2 Pecan was run with default parameters. 

3 FSA was run with the -exonerate, -anchored, -sof tmasked, and 
-fast flags. 

4 MUSCLE was run with default parameters. 

5 This problem was unable to be aligned due to running out of memory. 
The run-time and log-likelihood score of Crumble alignments. Each underlying 
alignment method (Pecan, FSA, MUSCLE) was tested on the dataset. Crumble 
was then used to break the problem into sub-problems that were 
approximately 60%, 30%, and 15% of the length of the original problem. 
While MUSCLE was unable to align this problem directly, using Crumble we 
were able to apply it to this problem. 



to decrease but then begins to increase. Prune's perfor- 
mance on this data set is not as good as on simulated 
data. But it does significantly decrease run-time with only 
small to moderate loss of accuracy. 

Future Development 

The results presented here are for RNA and DNA align- 
ment problems. While Crumble and Prune can be used 
to align protein sequences, their performance on this 
problem remains to be measured. 

Crumble and Prune achieve their performance gains 
by leveraging smaller problem sizes and taking advan- 
tage of parallelization. The relative contributions of 
these two factors to overall performance has yet to be 
explored for the various alignment algorithms and pro- 
blem sizes. 

The next major step in the development of these 
methods is to adapt them to align regions that have 
undergone chromosomal translocations and inversions. 
The methods presented here can be used after programs 
such as Mercator [35] have unscrambled the region. A 
more integrated approach could better take advantage of 
the performance gains possible. 

We also aim to adapt these methods to work on var- 
ious cloud computing platforms such as Amazon's Elas- 
tic Compute Cloud. 



Table 5 Prune results for twelve alignment problems 
from the Rfam database 





Time 


Agree. 


Pecan 1 


_o 


_a 


Prune w/Pecan 60% 


_o 


_a 


30% 


14.6 


0.651 


15% 


5.35 


0.649 


7% 


2.57 


0.643 


FSA 2 


13.6 


0.792 


Prune w/FSA 60% 


10.3 


0.669 


30% 


4.30 


0.615 


15% 


2.39 


0.636 


7% 


2.17 


0.636 


MUSCLE 3 


3.67 


0.709 


Prune w/MUSCLE 60% 


3.03 


0.704 


30% 


1.23 


0.649 


15% 


1.03 


0.672 


7% 


1.42 


0.659 


MAFFT 4 


0.04 


0.693 


SATe 5 


93.9 


0.753 



1 Pecan was run with default parameters. 

2 FSA was run with the -exonerate, -anchored, -sof tmasked, and 
-fast flags. 

3 MUSCLE was run with default parameters. 

4 MAFFT was run with -treein option. 

5 SATe was run with the -t option but limited to two iterations. We found 
that more iterations did almost nothing for accuracy. 

a The majority of problems were unable to be aligned due to running out of 
memory. 

The run-time and agreement score of Prune alignments of twelve RNA 
alignment problems from the Rfam database. The average time and 
agreement over all twelve problems are shown. Pecan, FSA, and MUSCLE 
were used as the underlying alignment method of Prune. MAFFT and SATe 
were also tested to provide comparison. We were unable to apply Pecan 
without using Prune because of memory issues. Using Prune, we were able to 
use Pecan to solve these alignment problems. Prune achieved a very large 
speedup with little loss of accuracy. Other alignment methods achieved a 
large speedup but more accuracy was lost. 

Conclusions 

We have presented two general methods, Crumble and 
Prune, for improving the running time of alignment 
programs. The methods work by breaking large align- 
ment problems into smaller sub-problems, solving those 
sub-problems, and reassembling them to form the full 
alignment. The sub-problems are formed so that many 
of them can be solved in parallel. This allows modern 
computer cluster systems to be leveraged to solve large 
alignment problems. 

We have tested Crumble and Prune on a very large set 
of simulated alignment problems. The test dataset includes 
both long (~1 megabase of sequence) and deep (1000 spe- 
cies) alignment problems. Crumble and Prune were able 
to dramatically improve the run-time of Pecan, FSA, and 
MUSCLE on long and deep alignment problems with very 
little loss in alignment accuracy. In some cases, Prune was 
also able to improve the accuracy of FSA and MUSCLE 
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while, at the same time, providing a boost in performance. 
We also tested Crumble and Prune on a set of biological 
data. While these data sets are relatively small with respect 
to the scale of problems that Crumble and Prune are 
designed to handle, they show that our methods do pro- 
vide a significant performance improvement with only 
moderate loss of accuracy. 

With Crumble and Prune we were able to apply 
Pecan, FSA, and MUSCLE to much longer and deeper 
problems than could be solved by running those pro- 
grams without Prune because of memory or time con- 
straints. This extends the applicability of these methods 
to larger alignment problems. 

We believe that these methods will enable the application 
of more sophisticated and statistically motivated alignment 
algorithms toward large, real world alignment problems. 

Availability 

♦ Project name: Crumble, Prune, Job-tree 

♦ Project home page: http://hgwdev.cse.ucsc.edu/ 
~krish/crumble_prune/. Datasets used to test the 
alignments are available at: http://hgwdev.cse.ucsc. 
edu/ ~krish/test_alignments/. 

♦ Operating System(s): Linux 2.6.18 

♦ Programming language: C++, Python 

♦ Other requirements: Boost C++ Libraries 1.46 
. License: GNU GPL 

Crumble, Prune, and Job-tree are licensed under the 
GPL and and available for download at: http://hgwdev. 
cse.ucsc.edu/~krish/crumble_prune/. The datasets used 
to test the alignments are available at: http://hgwdev.cse. 
ucsc.edu/ -krish/ test_alignments/. 
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